1 Rents in San Francisco 2000-2018

We are analysing a dataset of Craiglist listings for rental properties in the greater SF area. The data dictionary is as follows

variable class description
post_id character Unique ID
date double date
year double year
nhood character neighborhood
city character city
county character county
price double price in USD
beds double n of beds
baths double n of baths
sqft double square feet of rental
room_in_apt double room in apartment
address character address
lat double latitude
lon double longitude
title character title of listing
descr character description
details character additional details

The dataset was used in a recent tidyTuesday project, which is where we sourced the data.

# download directly off tidytuesdaygithub repo

rent <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv')

We have conducted an analysis on the variables. View the analysis below.

skimr::skim(rent)
Data summary
Name rent
Number of rows 200796
Number of columns 17
_______________________
Column type frequency:
character 8
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
post_id 0 1.00 9 14 0 200796 0
nhood 0 1.00 4 43 0 167 0
city 0 1.00 5 19 0 104 0
county 1394 0.99 4 13 0 10 0
address 196888 0.02 1 38 0 2869 0
title 2517 0.99 2 298 0 184961 0
descr 197542 0.02 13 16975 0 3025 0
details 192780 0.04 4 595 0 7667 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
date 0 1.00 2.01e+07 44694.07 2.00e+07 2.01e+07 2.01e+07 2.01e+07 2.02e+07 ▁▇▁▆▃
year 0 1.00 2.01e+03 4.48 2.00e+03 2.00e+03 2.01e+03 2.01e+03 2.02e+03 ▁▇▁▆▃
price 0 1.00 2.14e+03 1427.75 2.20e+02 1.30e+03 1.80e+03 2.50e+03 4.00e+04 ▇▁▁▁▁
beds 6608 0.97 1.89e+00 1.08 0.00e+00 1.00e+00 2.00e+00 3.00e+00 1.20e+01 ▇▂▁▁▁
baths 158121 0.21 1.68e+00 0.69 1.00e+00 1.00e+00 2.00e+00 2.00e+00 8.00e+00 ▇▁▁▁▁
sqft 136117 0.32 1.20e+03 5000.22 8.00e+01 7.50e+02 1.00e+03 1.36e+03 9.00e+05 ▇▁▁▁▁
room_in_apt 0 1.00 0.00e+00 0.04 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.00e+00 ▇▁▁▁▁
lat 193145 0.04 3.77e+01 0.35 3.36e+01 3.74e+01 3.78e+01 3.78e+01 4.04e+01 ▁▁▅▇▁
lon 196484 0.02 -1.22e+02 0.78 -1.23e+02 -1.22e+02 -1.22e+02 -1.22e+02 -7.42e+01 ▇▁▁▁▁
str(rent)
## spec_tbl_df [200,796 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ post_id    : chr [1:200796] "pre2013_134138" "pre2013_135669" "pre2013_127127" "pre2013_68671" ...
##  $ date       : num [1:200796] 20050111 20050126 20041017 20120601 20041021 ...
##  $ year       : num [1:200796] 2005 2005 2004 2012 2004 ...
##  $ nhood      : chr [1:200796] "alameda" "alameda" "alameda" "alameda" ...
##  $ city       : chr [1:200796] "alameda" "alameda" "alameda" "alameda" ...
##  $ county     : chr [1:200796] "alameda" "alameda" "alameda" "alameda" ...
##  $ price      : num [1:200796] 1250 1295 1100 1425 890 ...
##  $ beds       : num [1:200796] 2 2 2 1 1 1 1 3 NA 2 ...
##  $ baths      : num [1:200796] 2 NA NA NA NA NA 1 NA 1 NA ...
##  $ sqft       : num [1:200796] NA NA NA 735 NA NA NA NA NA NA ...
##  $ room_in_apt: num [1:200796] 0 0 0 0 0 0 0 0 0 0 ...
##  $ address    : chr [1:200796] NA NA NA NA ...
##  $ lat        : num [1:200796] NA NA NA NA NA NA NA NA NA NA ...
##  $ lon        : num [1:200796] NA NA NA NA NA NA NA NA NA NA ...
##  $ title      : chr [1:200796] "$1250 / 2br - 2BR/2BA   1145 ALAMEDA DE LAS PULGAS" "$1295 / 2br - Walk the Beach! 1 FREE MONTH + $500 TRADER JOES SHOPPING CERTIFICATE" "$1100 / 2br - cottage" "$1425 / 1br - 735ft² - BEST LOCATION SOUTHSHORE GARDENS APARTMENTS" ...
##  $ descr      : chr [1:200796] NA NA NA NA ...
##  $ details    : chr [1:200796] NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   post_id = col_character(),
##   ..   date = col_double(),
##   ..   year = col_double(),
##   ..   nhood = col_character(),
##   ..   city = col_character(),
##   ..   county = col_character(),
##   ..   price = col_double(),
##   ..   beds = col_double(),
##   ..   baths = col_double(),
##   ..   sqft = col_double(),
##   ..   room_in_apt = col_double(),
##   ..   address = col_character(),
##   ..   lat = col_double(),
##   ..   lon = col_double(),
##   ..   title = col_character(),
##   ..   descr = col_character(),
##   ..   details = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(rent)
## # A tibble: 6 × 17
##   post_id          date  year nhood city  county price  beds baths  sqft room_…¹
##   <chr>           <dbl> <dbl> <chr> <chr> <chr>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 pre2013_134138 2.01e7  2005 alam… alam… alame…  1250     2     2    NA       0
## 2 pre2013_135669 2.01e7  2005 alam… alam… alame…  1295     2    NA    NA       0
## 3 pre2013_127127 2.00e7  2004 alam… alam… alame…  1100     2    NA    NA       0
## 4 pre2013_68671  2.01e7  2012 alam… alam… alame…  1425     1    NA   735       0
## 5 pre2013_127580 2.00e7  2004 alam… alam… alame…   890     1    NA    NA       0
## 6 pre2013_152345 2.01e7  2006 alam… alam… alame…   825     1    NA    NA       0
## # … with 6 more variables: address <chr>, lat <dbl>, lon <dbl>, title <chr>,
## #   descr <chr>, details <chr>, and abbreviated variable name ¹​room_in_apt

Most column types correspond with what they should be, but date is stored as a double ‘20050111’ instead of a date 2005-01-11. Other column types like number of bedrooms (beds) are a double where an integer suffices. We could change this to integer to use less storage space. Description has the most missing values, after which address and details follow with also 190000+ missing values.

We have plotted the top 20 cities in terms of % of rental listings.

# creating a dataset with the top 20 cities by number of listings
top_20_cities <- rent %>%
  group_by(city) %>%
  filter(year < 2019) %>% # ensuring we have the right years
  summarize(total_listings = n()) %>%
  mutate(percentage = total_listings/sum(total_listings), # changing number to a percentage
         city = fct_reorder(city, total_listings)) %>% #ordering by # listings
  slice_max(total_listings, n=20) 

# graphing the top 20 cities as a barplot
ggplot(top_20_cities) +
  aes(x = percentage, y = city) +
  geom_col() +
  labs(
    title = 'San Fransisco accounts for more than a quarter of all rental classifieds',
    subtitle = '% of Craigslist listings, 2000-2018',
    x = NULL,
    y = NULL,
    caption = 'Source: Pennington, Kate(2018). Bay Area Craigslist Rental Housing Posts, 2000-2018',
    xticks = 'percentage'
  ) +
  scale_x_continuous(labels = scales::percent) +
  theme_minimal(base_size=16)

It is clear that San Francisco is responsible for 25% of the listings, making that the most interesting city to start investigating. To analyse what is happening with the rental prices in San Francisco, we have plotted the evolution of median prices in San Francisco for 0, 1, 2 and 3 bedroom listings.

# YOUR CODE GOES HERE

sf_rentals <- rent %>%
  filter(city == 'san francisco',
         beds < 4) %>% 
  mutate(beds = factor(beds)) %>%
  group_by(beds, year) %>%
  summarise(rent = median(price))
  

ggplot(sf_rentals) +
 aes(x = year, y = rent, color = beds) +
  facet_wrap(vars(beds), nrow = 1) +
  geom_line() +
  theme( legend.position = 'none') +
  labs(
    title = 'San Francisco rents have been steadily increasing',
    subtitle = '0 to 3-bed listings, 2000-2018',
    x = NULL,
    y = NULL,
    caption = 'Source: Pennington, Kate(2018). Bay Area Craigslist Rental Housing Posts, 2000-2018',
  )

We see that the rents have been increasing sincd 2005, with an exception for the recession in 2008. Since 2015 we see another decrease in rents for all sizes.

Having considered San Francisco, we turn to analyse the top 12 cities by number of listings in the Bay Area. We plot the median rental prices for those cities below.

# determining the top 12 cities in terms of listings
top12_cities <- rent %>%
  group_by(city) %>%
  summarize(total_listings = n()) %>%
  slice_max(total_listings, n=12)

# creating a vector with the city names
top12_cities <- top12_cities$city

# gathering the dataset to plot, focusing on 1-bedroom flats
one_bed_bay_area <- rent %>%
  filter(beds == 1,
         city %in% top12_cities) %>%
  group_by(city, year) %>%
  summarize(rent = median(price))

# creating the plot
ggplot(one_bed_bay_area) +
  aes(x = year, y = rent, colour = city) +
  facet_wrap(vars(city)) +
  geom_line() +
  theme( legend.position = 'none') +
  labs(
    title = 'Rental prices for 1-bedroom flats in the Bay Area',
    x = NULL,
    y = NULL,
    caption = 'Source: Pennington, Kate(2018). Bay Area Craigslist Rental Housing Posts, 2000-2018',
  )

We can clearly spot the financial crisis happening in 2008, when all the prices are going down across the cities and types of bedrooms in San Francisco. The effect is the smallest in Santa Rosa and Oakland, where we also don’t see an increase ahead of the recession, so a limited decrease is not unsurprising. The effect is the greatest where the increase is also the greatest, for example a three-bedroom in San Francisco or Santa Clara.

2 Analysis of movies- IMDB dataset

We are now analysing a dataset from imbd with 5000 movies. We will analyse the differences between genres in terms of ratings, popularity (facebook likes), and revenue.

movies <- read_csv(here::here('data',"movies.csv"))

The movies dataset was imported and inspected using Skimr, and there were no missing values identified within the dataset. When checking for duplicates within the dataset, we identified that there were 54 duplicated titles which may result in duplicated entries within the dataset.

2.0.1 Table showing the amount of IMBD movies per genre

movies_by_genre <- movies %>% 
  group_by(genre) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
  
movies_by_genre
## # A tibble: 17 × 2
##    genre       count
##    <chr>       <int>
##  1 Comedy        848
##  2 Action        738
##  3 Drama         498
##  4 Adventure     288
##  5 Crime         202
##  6 Biography     135
##  7 Horror        131
##  8 Animation      35
##  9 Fantasy        28
## 10 Documentary    25
## 11 Mystery        16
## 12 Sci-Fi          7
## 13 Family          3
## 14 Musical         2
## 15 Romance         2
## 16 Western         2
## 17 Thriller        1

2.0.2 Table showing movie revenue indicators per genre

movies %>% 
  mutate(return_on_budget = gross/budget) %>% 
  group_by(genre) %>% 
  summarize(average_gross = mean(gross), average_budget = mean(budget),  average_return_on_budget = mean(return_on_budget)) %>% 
  arrange(desc(average_return_on_budget)) 
## # A tibble: 17 × 4
##    genre       average_gross average_budget average_return_on_budget
##    <chr>               <dbl>          <dbl>                    <dbl>
##  1 Horror          37713738.      13504916.                 88.3    
##  2 Biography       45201805.      28543696.                 22.3    
##  3 Musical         92084000        3189500                  18.8    
##  4 Family         149160478.      14833333.                 14.1    
##  5 Documentary     17353973.       5887852.                  8.70   
##  6 Western         20821884        3465000                   7.06   
##  7 Fantasy         42408841.      17582143.                  6.68   
##  8 Animation       98433792.      61701429.                  5.01   
##  9 Comedy          42630552.      24446319.                  3.71   
## 10 Mystery         67533021.      39218750                   3.27   
## 11 Romance         31264848.      25107500                   3.17   
## 12 Drama           37465371.      26242933.                  2.95   
## 13 Adventure       95794257.      66290069.                  2.41   
## 14 Crime           37502397.      26596169.                  2.17   
## 15 Action          86583860.      71354888.                  1.92   
## 16 Sci-Fi          29788371.      27607143.                  1.58   
## 17 Thriller            2468         300000                   0.00823

2.0.3 Table showing the Top 15 directors ranked by highest mean gross revenue

movies %>% 
  group_by(director) %>% 
  summarize(highest_gross_revenue = sum(gross), mean_gross_revenue = mean(gross), median_gross_revenue = median(gross), standard_deviation = sd(gross)) %>% 
  slice_max(highest_gross_revenue, n = 15)
## # A tibble: 15 × 5
##    director          highest_gross_revenue mean_gross_revenue median_g…¹ stand…²
##    <chr>                             <dbl>              <dbl>      <dbl>   <dbl>
##  1 Steven Spielberg             4014061704         174524422. 164435221   1.01e8
##  2 Michael Bay                  2231242537         171634041. 138396624   1.27e8
##  3 Tim Burton                   2071275480         129454718.  76519172   1.09e8
##  4 Sam Raimi                    2014600898         201460090. 234903076   1.62e8
##  5 James Cameron                1909725910         318287652. 175562880.  3.09e8
##  6 Christopher Nolan            1813227576         226653447  196667606.  1.87e8
##  7 George Lucas                 1741418480         348283696  380262555   1.46e8
##  8 Robert Zemeckis              1619309108         124562239. 100853835   9.13e7
##  9 Clint Eastwood               1378321100          72543216.  46700000   7.55e7
## 10 Francis Lawrence             1358501971         271700394. 281666058   1.35e8
## 11 Ron Howard                   1335988092         111332341  101587923   8.19e7
## 12 Gore Verbinski               1329600995         189942999. 123207194   1.54e8
## 13 Andrew Adamson               1137446920         284361730  279680930.  1.21e8
## 14 Shawn Levy                   1129750988         102704635.  85463309   6.55e7
## 15 Ridley Scott                 1128857598          80632686.  47775715   6.88e7
## # … with abbreviated variable names ¹​median_gross_revenue, ²​standard_deviation

2.1 Graphics showing the spread of IMDB ratings per genre

2.1.1 Table showing the summary statistics for IMDB movie ratings

We have added both a density graph and a box plot to visually represent how ratings are distributed. We believe that the box plot more accurately represents the distribution of ratings per genre, with taking the count of rating submissions as well. When cross-analyzing the various distributions, the box plot diagrams allow for an easier comparison between genres.

# summarizing the dataset by genre
data <- movies %>% 
  group_by(genre) %>% 
  summarize(mean = mean(rating), min = min(rating), max(rating), median = median(rating), standard_dev = sd(rating))

data
## # A tibble: 17 × 6
##    genre        mean   min `max(rating)` median standard_dev
##    <chr>       <dbl> <dbl>         <dbl>  <dbl>        <dbl>
##  1 Action       6.23   2.1           9     6.3         1.03 
##  2 Adventure    6.51   2.3           8.6   6.6         1.09 
##  3 Animation    6.65   4.5           8     6.9         0.968
##  4 Biography    7.11   4.5           8.9   7.2         0.760
##  5 Comedy       6.11   1.9           8.8   6.2         1.02 
##  6 Crime        6.92   4.8           9.3   6.9         0.849
##  7 Documentary  6.66   1.6           8.5   7.4         1.77 
##  8 Drama        6.73   2.1           8.8   6.8         0.917
##  9 Family       6.5    5.7           7.9   5.9         1.22 
## 10 Fantasy      6.15   4.3           7.9   6.45        0.959
## 11 Horror       5.83   3.6           8.5   5.9         1.01 
## 12 Musical      6.75   6.3           7.2   6.75        0.636
## 13 Mystery      6.86   4.6           8.5   6.9         0.882
## 14 Romance      6.65   6.2           7.1   6.65        0.636
## 15 Sci-Fi       6.66   5             8.2   6.4         1.09 
## 16 Thriller     4.8    4.8           4.8   4.8        NA    
## 17 Western      5.7    4.1           7.3   5.7         2.26
#We have added an additional visualisation for the representation of the distribution of ratings
ggplot(movies) + aes(x = rating, y = genre ) + geom_boxplot() + labs(
    title = 'Distribution of IMDb movie ratings is largely uniform accross genres',
    subtitle = 'Box plot showing the variation of IMDB genre ratings',
    x = 'Ratings',
    y = 'Genres',
    caption = 'Source: Kaggle IMDB 5000 movie dataset',
    xticks = ''
  )

#Required visualisation
ggplot(movies) + aes(rating) + geom_density() +
  facet_wrap(vars(genre)) + labs(
    title = 'Distribution of IMDb movie ratings is largely uniform accross genres',
    subtitle = 'Density graph showing the variation in IMDB genre ratings',
    x = 'Ratings',
    y = 'Density',
    caption = 'Source: Kaggle IMDB 5000 movie dataset',
    xticks = ''
  )

The number of facebook likes is not a good predictor of how much money a movie will make at the box office as movies with the same number of likes received by cast earned vastly different amounts.

# creating a scatterplot of the likes and revenue
ggplot(movies) +
  aes(x = cast_facebook_likes, y = gross) +
  geom_point() +
  scale_x_log10(labels = comma) +
  scale_y_continuous(labels = dollar) + #We changed the scale of the visualsation to make it more presentable
  labs(
    title = "Number of Cast's Facebook Likes is not a Good Predictor of Gross Movie Earnings",
    subtitle = 'Relationship Between Facebook Likes Received by Cast and Total US Earnings',
    x = "# Facebook Likes Cast Members Received",
    y = "Gross Earnings in the US Box Office, not Inflation-adjusted",
    caption = 'Source: Kaggle IMDB 5000 movie dataset')

Gross US Earnings could be predicted by looking at the movie’s budget as the two variables display a positive relationship relationship.

# creating a scatterplot to analyse the revenue and ratings
ggplot(movies) +
  aes(x = budget, y = gross) +
  geom_point() +
  scale_x_continuous(labels = dollar) + 
  scale_y_continuous(labels = dollar) +
  geom_smooth(method='lm') + #We added a line of best fit to the visualsation to plot the general trend within the revenue
    labs(
    title = "Gross US Earnings Can Be Consistently Predicted by the Movie's Budget",
    subtitle = "Relationship Between Movie's Budget by Cast and Total US Earnings",
    x = "Movie's Budget",
    y = "Gross Earnings in the US Box Office, not Inflation-adjusted",
    caption = 'Source: Kaggle IMDB 5000 movie dataset')

In general, IMDB ratings could be used to predict the gross earnings of movies in the US - this positive relationship is particularly visible in the Action, Adventure, and Comedy genres. However, data in the ‘movies’ dataset is not uniformly distributed across genres, with some genres containing only a few data points, which prohibits a meaningful analysis of the relationship between IMDB ratings and gross earnings.

Outliers are most visible in the Action, Drama, and Family genres. Some genres (Biography, Crime) only contain movies with a minimum rating of approximately 5.0 - it might be the case that movies of these types generally receive higher ratings. Lastly, genres such as Fantasy and Sci-Fi do not exhibit any relationship between IMDB ratings and gross earnings.

# Creating a scatterplot of revenue and rating, faceted by genre
ggplot(movies) +
  aes(x = rating, y = gross, color = genre) +
  facet_wrap(vars(genre)) +
  theme(legend.position = 'none') +
  geom_point() +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Gross earnings could be predicted with IMDB ratings, with some genres lacking enough data points to identify a positive relationship",
    subtitle = 'Relationship Between IMDB Ratings and Total US Earnings',
    x = "IMDB Rating",
    y = "Gross Earnings in the US Box Office, not Inflation-adjusted",
    caption = 'Source: Kaggle IMDB 5000 movie dataset',
  )

3 Returns of financial stocks

Before we can analyse the returns of stocks, we decide which companies we want to analyse.

nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)
## Rows: 508
## Columns: 6
## $ symbol        <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "…
## $ name          <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie …
## $ ipo_year      <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999…
## $ sector        <chr> "Health Care", "Consumer Durables", "Health Care", "Heal…
## $ industry      <chr> "Medical/Dental Instruments", "Electrical Products", "Ma…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq…

Based on this dataset, we are showing the number of companies per sector.

comp_per_sector <- nyse %>% 
  group_by(sector) %>% 
  summarise(companies = n()) %>% 
  slice_max(companies, n=100) %>% 
  mutate(sector = fct_reorder(sector, companies))
comp_per_sector 
## # A tibble: 12 × 2
##    sector                companies
##    <fct>                     <int>
##  1 Finance                      97
##  2 Consumer Services            79
##  3 Public Utilities             60
##  4 Capital Goods                45
##  5 Health Care                  45
##  6 Energy                       42
##  7 Technology                   40
##  8 Basic Industries             39
##  9 Consumer Non-Durables        31
## 10 Miscellaneous                12
## 11 Transportation               10
## 12 Consumer Durables             8
ggplot(comp_per_sector) +
  aes(x = companies, y = sector) +
  geom_col() +
  labs(
    title = 'Finance is the Largest Sector, Consumer Durables is the smallest',
    subtitle = "Companies Per Sector",
    x = 'Companies',
    y = 'Sector',
    caption = 'Source: Federal Reserve Economic Database',
  ) 

comp_per_sector <- nyse %>% 
  group_by(sector) %>% 
  summarise(companies = n()) %>% 
  slice_max(companies, n=100) %>% 
  mutate(sector = fct_reorder(sector, companies))
  
ggplot(comp_per_sector) +
  aes(x = companies, y = sector) +
  geom_col()

comp_per_sector
## # A tibble: 12 × 2
##    sector                companies
##    <fct>                     <int>
##  1 Finance                      97
##  2 Consumer Services            79
##  3 Public Utilities             60
##  4 Capital Goods                45
##  5 Health Care                  45
##  6 Energy                       42
##  7 Technology                   40
##  8 Basic Industries             39
##  9 Consumer Non-Durables        31
## 10 Miscellaneous                12
## 11 Transportation               10
## 12 Consumer Durables             8
nyse
## # A tibble: 508 × 6
##    symbol name                             ipo_year sector       indus…¹ summa…²
##    <chr>  <chr>                            <chr>    <chr>        <chr>   <chr>  
##  1 MMM    3M Company                       n/a      Health Care  Medica… https:…
##  2 ABB    ABB Ltd                          n/a      Consumer Du… Electr… https:…
##  3 ABT    Abbott Laboratories              n/a      Health Care  Major … https:…
##  4 ABBV   AbbVie Inc.                      2012     Health Care  Major … https:…
##  5 ACN    Accenture plc                    2001     Miscellaneo… Busine… https:…
##  6 AAP    Advance Auto Parts Inc           n/a      Consumer Se… Other … https:…
##  7 AFL    Aflac Incorporated               n/a      Finance      Accide… https:…
##  8 A      Agilent Technologies, Inc.       1999     Capital Goo… Biotec… https:…
##  9 AEM    Agnico Eagle Mines Limited       n/a      Basic Indus… Precio… https:…
## 10 APD    Air Products and Chemicals, Inc. n/a      Basic Indus… Major … https:…
## # … with 498 more rows, and abbreviated variable names ¹​industry,
## #   ²​summary_quote

We are choosing some stocks and downloading their ticker data to analyse.

# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("MMM","ABT","ACN","ANTM","AGR","TSLA","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2022-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 16,362
## Columns: 8
## Groups: symbol [6]
## $ symbol   <chr> "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM"…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 86.8, 87.0, 86.3, 86.9, 86.6, 85.7, 87.3, 88.0, 88.5, 87.7, 8…
## $ high     <dbl> 87.3, 87.3, 87.9, 87.2, 87.3, 87.3, 88.3, 88.8, 88.9, 88.1, 8…
## $ low      <dbl> 86.7, 86.3, 86.1, 85.6, 85.9, 85.7, 87.3, 87.9, 87.8, 87.4, 8…
## $ close    <dbl> 86.8, 86.7, 86.7, 86.1, 86.2, 87.2, 87.7, 88.7, 88.0, 88.1, 8…
## $ volume   <dbl> 2632800, 2644100, 4081300, 3452600, 3355500, 3475200, 3024400…
## $ adjusted <dbl> 62.3, 62.2, 62.2, 61.9, 61.9, 62.6, 63.0, 63.7, 63.2, 63.3, 6…

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

3.0.1 Summary Statistics For Monthly Returns

summary_monthly_returns <- myStocks_returns_monthly %>% 
  
  group_by(symbol) %>% 
  summarise(min_return = min(monthly_returns),
            max_return = max(monthly_returns),
            median_return = median(monthly_returns),
            mean_return = mean(monthly_returns),
            sd_return = sd(monthly_returns))
  
  
   
summary_monthly_returns
## # A tibble: 6 × 6
##   symbol min_return max_return median_return mean_return sd_return
##   <chr>       <dbl>      <dbl>         <dbl>       <dbl>     <dbl>
## 1 ABT        -0.152      0.172       0.0116      0.0140     0.0539
## 2 ACN        -0.145      0.160       0.0216      0.0164     0.0613
## 3 AGR        -0.112      0.186       0.00272     0.00838    0.0521
## 4 MMM        -0.150      0.113       0.0121      0.00651    0.0551
## 5 SPY        -0.125      0.127       0.0146      0.0106     0.0404
## 6 TSLA       -0.224      0.811       0.0117      0.0501     0.177

To analyse the distribution of the returns, we plot a density plot for each of the stocks.

myStocks_returns_monthly %>% 
  group_by(symbol) %>% 
  ggplot() +
  aes (x = monthly_returns, fill = symbol) +
  geom_density() +
  labs(
    title = 'TSLA Has the Widest Spread',
    subtitle = 'Density Plot - Stockwise Monthly Return',
    x = 'Return',
    y = 'Density',
    caption = 'Source: Federal Reserve Economic Database'
  ) +
  facet_wrap(~symbol, ncol = 1) +
  #facet_grid()+
 theme_bw()+
  theme(legend.position = "none")+
  scale_x_continuous(labels = scales::percent)

# YOUR CODE GOES HERE

The graph shows that TSLA has the widest spread, suggesting it has the greatest deviation from the mean and thus is the most risky stock. The remaining stocks have a similar spread, we need further analysis to come to a concrete conclusion to determine which stock is the least risky.

Finally we made a plot to show the expected monthly return per stock.

  summary_monthly_returns %>% 
  ggplot() +
  aes(x = mean_return, y = sd_return, label = symbol, colour = symbol)+
  ggrepel::geom_text_repel()+
  geom_point()+
  coord_flip()+
 # theme_minimal()+
  theme(legend.position = "none")+
  scale_x_continuous(labels = scales::percent)+
  scale_y_continuous(labels = scales::percent)+
  labs(
    title = "Tesla Stock Involves the Highest Risk and Return" ,
    subtitle = "Risk Return Analysis",
    x = "Percentage Risk Involved",
    y = "Expected Monthly Return",
    caption = 'Source: Federal Reserve Economic Database'
  )

According to our analysis, TSLA stock is the most risky stock with fluctuations around 5%. It also offers the highest return with an expected monthly return of 16%. On the other hand MMM provides the least risk with fluctations less than 1% and also offers a relatively high retrun, which is above 4%.

4 On your own: Spotify

We have downloaded a large dataset on spotify songs and are going to analyse what makes a track popular.

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

The data dictionary can be found below

variable class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Before we can get into analysing this dataset, we need to examine the distribution of the popularity of tracks on spotify. See below the histogram that shows the distribution.

ggplot(spotify_songs) +
  aes(x = track_popularity) +
  geom_histogram() +
  labs(
    title = 'Many songs have a popularity of 0 on spotify',
    subtitle = 'A histogram of the popularity rating of spotify songs',
    x = 'Popularity rating',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

The track popularity seems to be a roughly normal distribution, with a lot of tracks with close to 0 popularity that do not fit with the distribution. This can be explained by a lot of small artists posting their own songs that not many people listen to, like someone creating a new podcast in their garage. Like in the music industry in general, not everyone makes it.

To start determining why some songs are more popular than others, we first show the distribution and summary statistics of the variables. We will try to see if we can determine whether the features look like a normal distribution.

skim(spotify_songs) # seeing if we can determine whether the audio features look like a normal distribution
Data summary
Name spotify_songs
Number of rows 32833
Number of columns 23
_______________________
Column type frequency:
character 10
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
track_id 0 1 22 22 0 28356 0
track_name 5 1 1 144 0 23449 0
track_artist 5 1 2 69 0 10692 0
track_album_id 0 1 22 22 0 22545 0
track_album_name 5 1 1 151 0 19743 0
track_album_release_date 0 1 4 10 0 4530 0
playlist_name 0 1 6 120 0 449 0
playlist_id 0 1 22 22 0 471 0
playlist_genre 0 1 3 5 0 6 0
playlist_subgenre 0 1 4 25 0 24 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
track_popularity 0 1 42.48 24.98 0.0 24.00 45.00 62.00 1.00e+02 ▆▆▇▆▁
danceability 0 1 0.65 0.15 0.0 0.56 0.67 0.76 9.80e-01 ▁▁▃▇▃
energy 0 1 0.70 0.18 0.0 0.58 0.72 0.84 1.00e+00 ▁▁▅▇▇
key 0 1 5.37 3.61 0.0 2.00 6.00 9.00 1.10e+01 ▇▂▅▅▆
loudness 0 1 -6.72 2.99 -46.5 -8.17 -6.17 -4.64 1.27e+00 ▁▁▁▂▇
mode 0 1 0.57 0.50 0.0 0.00 1.00 1.00 1.00e+00 ▆▁▁▁▇
speechiness 0 1 0.11 0.10 0.0 0.04 0.06 0.13 9.20e-01 ▇▂▁▁▁
acousticness 0 1 0.18 0.22 0.0 0.02 0.08 0.26 9.90e-01 ▇▂▁▁▁
instrumentalness 0 1 0.08 0.22 0.0 0.00 0.00 0.00 9.90e-01 ▇▁▁▁▁
liveness 0 1 0.19 0.15 0.0 0.09 0.13 0.25 1.00e+00 ▇▃▁▁▁
valence 0 1 0.51 0.23 0.0 0.33 0.51 0.69 9.90e-01 ▃▇▇▇▃
tempo 0 1 120.88 26.90 0.0 99.96 121.98 133.92 2.39e+02 ▁▂▇▂▁
duration_ms 0 1 225799.81 59834.01 4000.0 187819.00 216000.00 253585.00 5.18e+05 ▁▇▇▁▁
ggplot(spotify_songs) +
  aes(x = acousticness) +
  geom_histogram() +
  labs(
    title = 'Acousticness is exponentially distributed',
    subtitle = 'Distribution of acousticness',
    x = 'Acousticness',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

# acousticness looks like an exponential distribution, with most songs not having any acousticness at all.

ggplot(spotify_songs) +
  aes(x = liveness) +
  geom_histogram() +
  labs(
    title = 'Liveness is skewed right',
    subtitle = 'Distribution of Liveness',
    x = 'Liveness',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Liveness is skewed right. Most songs have some liveness, but a few have a lot.

ggplot(spotify_songs) +
  aes(x = speechiness) +
  geom_histogram() +
  labs(
    title = 'Speechiness is exponentially distributed',
    subtitle = 'Distribution of Speechiness',
    x = 'Speechiness',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Speechiness is an exponential distribution, similar to acousticness. Speechiness, however, has very few songs with exactly 0 speechiness.

ggplot(spotify_songs) +
  aes(x = instrumentalness) +
  geom_histogram() +
  labs(
    title = 'Instrumentalness is skewed right',
    subtitle = 'Distribution of Instrumentalness',
    x = 'Instrumentalness',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Instrumentalness is skewed right. The vast majority of songs have a 0 instrumentalness, but there's another small bump around 0.8 where some songs have a lot of instrumentalness.

ggplot(spotify_songs) +
  aes(x = energy) +
  geom_histogram() +
  labs(
    title = 'Energy is normally distributed with a cutoff',
    subtitle = 'Distribution of Energy',
    x = 'Energy',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Energy looks like a normal distribution, with a mean around 0.75. However, it is cut off at 1 so it is not a normal distribution.

ggplot(spotify_songs) +
  aes(x = loudness) +
  geom_histogram() +
  labs(
    title = 'Loudness is normally distributed with a small standard deviation',
    subtitle = 'Distribution of Loudness',
    x = 'Loudness',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Loudness looks like a normal distribution with a very small standard deviation, though it is slightly skewed left. 

ggplot(spotify_songs) +
  aes(x = danceability) +
  geom_histogram() +
  labs(
    title = 'Danceability is normally distributed',
    subtitle = 'Distribution of Danceability',
    x = 'Danceability',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Danceability looks like a normal distribution, slightly skewed left.

ggplot(spotify_songs) +
  aes(x = valence) +
  geom_histogram() +
  labs(
    title = 'Valence is normally distributed with a cutoff',
    subtitle = 'Distribution of Valence',
    x = 'Valence',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Valence looks like a normal distribution that's cut off at 0 and 1.

ggplot(spotify_songs) +
  aes(x = tempo) +
  geom_histogram()  +
  labs(
    title = 'Tempo  is bimodal normally distributed',
    subtitle = 'Distribution of Tempo',
    x = 'Tempo',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Tempo looks like a bimodal noraml distribution, with the 2 peaks being 2 means. 

ggplot(spotify_songs) +
  aes(x = duration_ms) +
  geom_histogram()  +
  labs(
    title = 'Duration is normally distributed, skewed to the right',
    subtitle = 'Distribution of Duration',
    x = 'Duration',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#Duration appears as a normal distribution with a skew to the right.

ggplot(spotify_songs) +
  aes(x = key) +
  geom_histogram()  +
  labs(
    title = 'Key can only be whole numbers',
    subtitle = 'Distribution of Key',
    x = 'Key',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

# The key does not have values everywhere on the scale, it looks like only whole numbers are possible on the key. 

ggplot(spotify_songs) +
  aes(x = mode) +
  geom_histogram() +
  labs(
    title = 'Mode is either 1 (Major) or  0 (Minor)',
    subtitle = 'Distribution of Mode',
    x = 'Mode',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

#  Mode only includes 0 or 1, nothing else.

We choose valence and danceability to start analysing what determines the track popularity.

ggplot(spotify_songs) +
  aes(valence, track_popularity) +
  geom_point() +
  geom_smooth() +
  labs(
    title = 'Valence and track popularity appear unrelated',
    subtitle = 'Scatterplot of Valence and popularity',
    x = 'Valence',
    y = 'Popularity', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

cor(spotify_songs$valence, spotify_songs$track_popularity)
## [1] 0.0332

Track popularity and valence do not appear to have a clear relationship, the values for both are spread out across the scatter plot and the correlation coefficient is very low.

ggplot(spotify_songs) +
  aes(danceability, track_popularity) +
  geom_point() +
  geom_smooth() +
  labs(
    title = 'High danceability is related with a higher popularity',
    subtitle = 'Scatterplot of danceability and popularity',
    x = 'Danceability',
    y = 'Popularity', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

cor(spotify_songs$danceability, spotify_songs$track_popularity)
## [1] 0.0647

The correlation coefficient is again very low, but a much higher danceability is associated with a slightly higher popularity. It seems that both of these tracks do not do much to predict the popularity of songs.

Next we determine whether the modality matters for the danceability and track popularity.

mode_analysis <- spotify_songs %>%
  group_by(mode) %>%
  summarize(avg_danceability = mean(danceability),
            median_danceability = median(danceability),
            avg_popularity = mean(track_popularity),
            median_popularity = median(track_popularity))

mode_analysis
## # A tibble: 2 × 5
##    mode avg_danceability median_danceability avg_popularity median_popularity
##   <dbl>            <dbl>               <dbl>          <dbl>             <dbl>
## 1     0            0.665               0.68            42.2                45
## 2     1            0.647               0.663           42.7                46
library(BSDA)

spotify_songs %>%
  mutate(mode = factor(mode)) %>%
ggplot() +
  aes(x = danceability, fill = mode) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  labs(
    title = 'Danceability does not differ between major and minor',
    subtitle = 'Histograms of danceability for major (1) and minor (0) songs',
    x = 'Danceability',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

mode1 <- filter(spotify_songs, mode == 1)
mode0 <- filter(spotify_songs, mode == 0)

There appears to be no significant difference in danceability between major and minor modes, as the mean appears in the same in both graphs and the spread does not appear different.

spotify_songs %>%
  mutate(mode = factor(mode)) %>%
ggplot() +
  aes(x = track_popularity, fill = mode) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  labs(
    title = 'Track popularity does not differ between major and minor',
    subtitle = 'Histograms of Track popularity for major (1) and minor (0) songs',
    x = 'Track popularity',
    y = 'Count', 
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

mode1 <- filter(spotify_songs, mode == 1)
mode0 <- filter(spotify_songs, mode == 0)

There appears to be no significant difference in track_popularity between major and minor modes. Based on these analyses, we cannot yet decide what audio features determine the popularity of songs.

5 Challenge 1: Replicating a chart

We have created a graph to show the cumulative % change in median rental prices for 0-, 1-, and 2-bed flats between 2000 and 2018 for the top twelve cities in Bay Area.

base_years <- rent %>%
  group_by(beds, city) %>%
  top_n(-1, year ) %>%
  summarize(base_rent = median(price))


rent_cum_change <- rent %>%
  filter(beds < 3,
         city %in% top12_cities) %>%
  group_by(beds, city, year) %>%
  summarize(med_rent = median(price)) %>% 
  left_join(base_years, by = c('beds','city')) %>%
  mutate(index = med_rent/base_rent * 100)

head(rent_cum_change)
## # A tibble: 6 × 6
## # Groups:   beds, city [1]
##    beds city      year med_rent base_rent index
##   <dbl> <chr>    <dbl>    <dbl>     <dbl> <dbl>
## 1     0 berkeley  2001    1148.     1148. 100  
## 2     0 berkeley  2002    1000      1148.  87.1
## 3     0 berkeley  2003     888.     1148.  77.3
## 4     0 berkeley  2004     825      1148.  71.9
## 5     0 berkeley  2005     850      1148.  74.1
## 6     0 berkeley  2006     900      1148.  78.4
ggplot(rent_cum_change) +
  aes(x = year, y = index, color = city) +
  geom_line(show.legend = FALSE) + 
  facet_grid( beds ~ city ) +
  labs(
    title = 'Cumulative % change in 0, 1, and 2-bed rentals in Bay Area',
    subtitle = '2000-2018',
    x = NULL,
    y = NULL,
    caption = 'Tidy Tuesday Spotify Songs dataset (accessed 04-09-2022)'
  )

6 Challenge 2: 2016 California Contributors plots

We are looking to create a plot to analyse the origin of the contributions to the campaigns of Hillary Clinton and Donald Trump in the 2016. This we will do by joining a dataset with the names of the cities and their zipcodes with a dataset featuring the contributions of each of the candidates.

library(tidytext)
# Make sure you use vroom() as it is significantly faster than read.csv()
CA_contributors_2016 <- vroom::vroom(here::here("data","CA_contributors_2016.csv"))
zipcodes <- vroom::vroom(here::here("data","zip_code_database.csv"))
head(CA_contributors_2016)
## # A tibble: 6 × 4
##   cand_nm                 contb_receipt_amt   zip contb_date
##   <chr>                               <dbl> <dbl> <date>    
## 1 Clinton, Hillary Rodham              50   94939 2016-04-26
## 2 Clinton, Hillary Rodham             200   93428 2016-04-20
## 3 Clinton, Hillary Rodham               5   92337 2016-04-02
## 4 Trump, Donald J.                     48.3 95334 2016-11-21
## 5 Sanders, Bernard                     40   93011 2016-03-04
## 6 Trump, Donald J.                    244.  95826 2016-11-24
head(zipcodes)
## # A tibble: 6 × 16
##   zip   type     primary_…¹ accep…² unacc…³ state county timez…⁴ area_…⁵ latit…⁶
##   <chr> <chr>    <chr>      <chr>   <chr>   <chr> <chr>  <chr>     <dbl>   <dbl>
## 1 00501 UNIQUE   Holtsville <NA>    I R S … NY    Suffo… Americ…     631    40.8
## 2 00544 UNIQUE   Holtsville <NA>    Irs Se… NY    Suffo… Americ…     631    40.8
## 3 00601 STANDARD Adjuntas   <NA>    Colina… PR    Adjun… Americ…  787939    18.2
## 4 00602 STANDARD Aguada     <NA>    Alts D… PR    <NA>   <NA>        787    18.4
## 5 00603 STANDARD Aguadilla  Ramey   Bda Ca… PR    Aguad… Americ…     787    18.4
## 6 00604 PO BOX   Aguadilla  Ramey   <NA>    PR    <NA>   <NA>         NA    18.4
## # … with 6 more variables: longitude <dbl>, world_region <chr>, country <chr>,
## #   decommissioned <dbl>, estimated_population <dbl>, notes <chr>, and
## #   abbreviated variable names ¹​primary_city, ²​acceptable_cities,
## #   ³​unacceptable_cities, ⁴​timezone, ⁵​area_codes, ⁶​latitude
# inspecting the dataframes to find out what the columns look like & how to join them later
# preparing zipcodes dataset for merging --> change type of zip-column to double
zipcodes <- zipcodes %>%
  mutate(zip = as.double(zip)) %>%
  select(zip, primary_city) #selecting only relevant columns

# combining the 2 datasets by zip code
contributors_by_city <- left_join(CA_contributors_2016, zipcodes, by = 'zip')

#taking a look to see if the join was successful
head(contributors_by_city)
## # A tibble: 6 × 5
##   cand_nm                 contb_receipt_amt   zip contb_date primary_city
##   <chr>                               <dbl> <dbl> <date>     <chr>       
## 1 Clinton, Hillary Rodham              50   94939 2016-04-26 Larkspur    
## 2 Clinton, Hillary Rodham             200   93428 2016-04-20 Cambria     
## 3 Clinton, Hillary Rodham               5   92337 2016-04-02 Fontana     
## 4 Trump, Donald J.                     48.3 95334 2016-11-21 Livingston  
## 5 Sanders, Bernard                     40   93011 2016-03-04 Camarillo   
## 6 Trump, Donald J.                    244.  95826 2016-11-24 Sacramento
candidates <- contributors_by_city %>%
  group_by(cand_nm) %>% 
  summarise(total_contr = sum(contb_receipt_amt)) %>%
  top_n(10, total_contr)

candidates <- candidates$cand_nm

total_contb_by_city <- contributors_by_city %>%
  group_by(cand_nm, primary_city) %>%
  summarize(total_contr = sum(contb_receipt_amt)) %>% 
  ungroup %>% group_by(cand_nm) %>% top_n(10) %>%
  ungroup

candidates
##  [1] "Bush, Jeb"                 "Carson, Benjamin S."      
##  [3] "Clinton, Hillary Rodham"   "Cruz, Rafael Edward 'Ted'"
##  [5] "Fiorina, Carly"            "Kasich, John R."          
##  [7] "Paul, Rand"                "Rubio, Marco"             
##  [9] "Sanders, Bernard"          "Trump, Donald J."
total_contb_by_city %>% 
  filter(cand_nm %in% candidates) %>% # selecting relevant candidates
  # ordering the values by size of contribution
  mutate(cand_nm = as.factor(cand_nm),
        primary_city = reorder_within(primary_city, total_contr, cand_nm)) %>%
  ggplot() + # plotting the graph
  aes(primary_city, total_contr, fill = cand_nm) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~cand_nm, scales = 'free') +
  coord_flip() +
  scale_x_reordered() +
  labs(x = NULL,
         y = "Amount raised",
         title = "Where did candidates raise the most money") +
  scale_y_continuous(labels = scales::dollar_format())

7 Details